Outline

  • Wrangling using verbs
  • Plotting data using a grammar
  • Mapping for data analysis
  • Statistical inference with data visualisation
  • Perceptual principles for good graphic design
  • Representing uncertainty

dplyr verbs

There are five primary dplyr verbs, representing distinct data analysis tasks:

  • filter: Extract specified rows of a data frame
  • arrange: Reorder the rows of a data frame
  • select: Select specified columns of a data frame
  • mutate: Add new or transform a column of a data frame
  • summarise: Create collapsed summaries of a data frame
  • (group_by: Introduce structure to a data frame)

Filter

Data from an experiment on the taste of french fries, cooked using three different oils over a 10 week period. There were 12 people tasting two fries (replicates) from each batch.

Get a subset of the observations (horizontal slice)

load("data/french_fries.rda")
french_fries |>
    filter(subject == 3, time == 1) 
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid
* <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0  
2 1     1         3           2   14       0      0      1.1
3 1     2         3           1   13.9     0      0      3.9
4 1     2         3           2   13.4     0.1    0      1.5
5 1     3         3           1   14.1     0      0      1.1
6 1     3         3           2    9.5     0      0.6    2.8
# ℹ 1 more variable: painty <dbl>
ff_long <- french_fries |> 
  pivot_longer(potato:painty, 
               names_to = "type", 
               values_to = "rating")

Arrange

Order the observations (hierarchically)

french_fries |>
    arrange(desc(rancid)) |> 
    head()
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
1 9     2         51          1    7.3     2.3      0   14.9
2 10    1         86          2    0.7     0        0   14.3
3 5     2         63          1    4.4     0        0   13.8
4 9     2         63          1    1.8     0        0   13.7
5 5     2         19          2    5.5     4.7      0   13.4
6 4     3         63          1    5.6     0        0   13.3
# ℹ 1 more variable: painty <dbl>

Select

Select a subset of the variables (vertical selection)

french_fries |>
    select(time, treatment, subject, rep, potato) |> 
    head()
# A tibble: 6 × 5
  time  treatment subject   rep potato
  <fct> <fct>     <fct>   <dbl>  <dbl>
1 1     1         3           1    2.9
2 1     1         3           2   14  
3 1     1         10          1   11  
4 1     1         10          2    9.9
5 1     1         15          1    1.2
6 1     1         15          2    8.8

Summarise

Summarise observations into a (set of) one-number statistic(s)

french_fries |>
    summarise( 
      mean_rancid = mean(rancid, na.rm=TRUE), 
      sd_rancid = sd(rancid, na.rm = TRUE)
      ) 
# A tibble: 1 × 2
  mean_rancid sd_rancid
        <dbl>     <dbl>
1        3.85      3.78

Summarise and group_by

french_fries |>
    group_by(time, treatment) |>
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
# A tibble: 30 × 4
# Groups:   time [10]
   time  treatment mean_rancid sd_rancid
   <fct> <fct>           <dbl>     <dbl>
 1 1     1                2.76      3.21
 2 1     2                1.72      2.71
 3 1     3                2.6       3.20
 4 2     1                3.9       4.37
 5 2     2                2.14      3.12
 6 2     3                2.50      3.38
 7 3     1                4.65      3.93
 8 3     2                2.90      3.77
 9 3     3                3.6       3.59
10 4     1                2.08      2.39
# ℹ 20 more rows

Let’s use these tools

To answer these french fry experiment questions:

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Completeness

  • If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person in each week.

  • To check: tabulate number of records for each subject, time and treatment. See n().

french_fries |> 
  group_by(subject) |> 
  summarize(n = n()) 
# A tibble: 12 × 2
   subject     n
   <fct>   <int>
 1 3          54
 2 10         60
 3 15         60
 4 16         60
 5 19         60
 6 31         54
 7 51         60
 8 52         60
 9 63         60
10 78         60
11 79         54
12 86         54

Three subjects skipped a week.


Other nice short cuts

instead of group_by(subject) |> summarize(n = n()) we can use:

  • group_by(subject) |> tally()
  • count(subject)

Counts for subject by time

options(width=120)
french_fries |>
  count(subject, time) |>
  pivot_wider(names_from="time", values_from="n")
# A tibble: 12 × 11
   subject   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
   <fct>   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 3           6     6     6     6     6     6     6     6     6    NA
 2 10          6     6     6     6     6     6     6     6     6     6
 3 15          6     6     6     6     6     6     6     6     6     6
 4 16          6     6     6     6     6     6     6     6     6     6
 5 19          6     6     6     6     6     6     6     6     6     6
 6 31          6     6     6     6     6     6     6     6    NA     6
 7 51          6     6     6     6     6     6     6     6     6     6
 8 52          6     6     6     6     6     6     6     6     6     6
 9 63          6     6     6     6     6     6     6     6     6     6
10 78          6     6     6     6     6     6     6     6     6     6
11 79          6     6     6     6     6     6     6     6     6    NA
12 86          6     6     6     6     6     6     6     6    NA     6

How do scores change over time?

ggplot(data=ff_long, aes(x=time, y=rating, colour=treatment)) +
  geom_point() +
  facet_grid(subject~type) 

Plot differently

Get summary of ratings over replicates and connect the dots

Code
ff_av <- ff_long |> 
  group_by(subject, time, type, treatment) |>
  summarise(rating=mean(rating))

ggplot(data=ff_long, aes(x=time, y=rating, colour=treatment)) + 
  facet_grid(subject~type) +
  geom_line(data=ff_av, aes(group=treatment))

With tidy data, wrangling is a logical and organised process.

Plotting data using a grammar

Grammar of graphics

What is a data plot?

  • data
  • aesthetics: mapping of variables to graphical elements
  • geom: type of plot structure to use
  • transformations: log scale, …
  • layers: multiple geoms, multiple data sets, annotation
  • facets: show subsets in different plots
  • themes: modifying style

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION> +
  <THEME>

Why use a grammar?

  • Change the script a little, make all sorts of types of plots.

  • Instead of plots being like unique animals in a zoo, the script is like having the genetic footprint to see how different plots are related.

  • There is a tight connection with statistical thinking. This makes it possible to do statistical inference and make valid decisions from data plots.

Example: Tuberculosis data

(Current) TB case notifications data from WHO. Also available via R package getTBinR. Here we have Indonesia incidence 1995-2012.

Code
ggplot(tb_idn, aes(x = year, 
                  y = count, 
                  fill = sex)) +
  geom_bar(stat = "identity") +
  facet_grid(~age) 
  • What do you learn about tuberculosis incidences in Indonesia from this plot?
  • Give three changes to the plot that would improve it.

Compare males and females

Code
ggplot(tb_idn, aes(x=year, y=count, fill=sex)) +  
  geom_bar(stat="identity", position="fill") + 
  ylab("proportion") + 
  facet_grid(~age_group) +
  scale_x_continuous("year", breaks = seq(2000, 2012, 5), 
                     labels = c("00", "05", "10"))
  • What do we learn about the data that is different from the previous plot?
  • What is easier and what is harder or impossible to learn from this arrangement?

Separate plots

Code
# Make separate plots for males and females, focus on counts by category
ggplot(tb_idn, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") +
  facet_grid(sex~age_group) + 
  scale_x_continuous("year", breaks=seq(2000, 2012, 5), 
                     labels=c("00", "05", "10"))

Focus is now separately on the temporal trend for each sex and age group. What do we see?

Trend was tapering in younger age groups, but it is steadily increasing for over 65.

Make a pie

Code
# How to make a pie instead of a barchart - not straight forward
ggplot(tb_idn, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") + 
  facet_grid(sex~age_group) + 
  coord_polar() + 
  scale_x_continuous("year", breaks=seq(2000, 2012, 5), 
                     labels=c("00", "05", "10"))

This isn’t a pie, it’s a rose plot!

Stacked bar

Code
# Step 1 to make the pie
ggplot(tb_idn, aes(x = 1, y = count, fill = factor(year))) +
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~age_group) +
  scale_fill_viridis_d("", option="inferno") 

Step 1: turn it into a stacked bar.

Pie chart

Code
# Now we have a pie, note the mapping of variables
# and the modification to the coord_polar
ggplot(tb_idn, aes(x = 1, y = count, fill = factor(year))) + 
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~age_group) +
  scale_fill_viridis_d("", option="inferno") +
  coord_polar(theta = "y") 

  • What are the pros, and cons, of using the pie chart for this data?
  • Would it be better if the pies used age for the segments, and facetted by year (and sex)?
Code
# age for segments, facet by year and sex
ggplot(tb_idn, aes(x = 1, y = count, fill = factor(age_group))) + 
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~year) + 
  scale_fill_viridis_d("", option="inferno") +
  coord_polar(theta = "y") +
  theme(legend.position="bottom")

Line plot vs barchart

Code
ggplot(tb_idn, aes(x=year, y=count, colour=sex)) +
  geom_line() + 
  geom_point() +
  facet_grid(~age_group) +
  ylim(c(0,NA)) + 
  scale_x_continuous("year", breaks=seq(2000, 2012, 5), 
                     labels=c("00", "05", "10"),
                     limits=c(2000, 2012))

  • We can read counts for both sexes
  • Males and females can be directly compared
  • Temporal trend is more visible

Line plot vs barchart

Code
tb_idn |> group_by(year, age_group) |> 
  summarise(p = count[sex=="m"]/sum(count)) |>
  ggplot(aes(x=year, y=p)) +
    geom_hline(yintercept = 0.50, colour="grey50", linewidth=2) +
    geom_line() + 
    geom_point() +
    facet_grid(~age_group) +
    ylab("Proportion of Males") + 
    scale_x_continuous("year", breaks=seq(2000, 2012, 5), 
                     labels=c("00", "05", "10"),
                     limits=c(2000, 2012))

  • Attention is forced to proportion of males
  • Direct comparison of counts within year and age
  • Equal proportion guideline provides a baseline for comparison

Avoid “connect the dots” plots

Code
ggplot(tb_idn, aes(x=year, y=count, colour=sex)) +
  geom_point() +
  geom_smooth(se=F) + 
  facet_grid(~age_group) +
  ylim(c(0,NA)) + 
  scale_x_continuous("year", breaks=seq(2000, 2012, 5), 
                     labels=c("00", "05", "10"),
                     limits=c(2000, 2012))

  • Focus on the trend
  • Actual counts might vary a little
  • Temporal trend is even clearer

Avoid “connect the dots” plots

Code
tb_idn |> group_by(year, age_group) |> 
  summarise(p = count[sex=="m"]/sum(count)) |>
  ggplot(aes(x=year, y=p)) +
    geom_hline(yintercept = 0.50, colour="grey50", linewidth=2) +
    geom_point() +
    geom_smooth(se=F) + 
    facet_grid(~age_group) +
    ylab("Proportion of Males") + 
    scale_x_continuous("year", breaks=seq(2000, 2012, 5), 
                     labels=c("00", "05", "10"),
                     limits=c(2000, 2012))

  • Attention is trend of proportion of males

With a small change in code, many different plots of the same data were generated. It allowed us to see different patterns and relationships in the data.

gg extensions

  • There are more than 150 extensions to ggplot2
  • Many adhere to the grammar, to define new types of displays, like
    • ggdist: including representation of uncertainty and error
    • gganimate: specifies animations as layers
  • Others are supporting packages, such as
    • patchwork for laying out multiple displays
    • ggthemes for styling plots

https://exts.ggplot2.tidyverse.org/gallery/

Exploring distributions and uncertainty

ggdist

Code
tb_inc_100k <- read_csv("data/TB_burden_countries_2025-07-22.csv") |>
  filter(iso3 %in% c("USA", "AUS"))
ggplot(tb_inc_100k, aes(y = iso3, 
                        x = e_inc_100k)) +
  stat_gradientinterval(fill = "darkorange") +
  ylab("") +
  xlab("Inc per 100k") +
  theme_ggdist()

Code
ggplot(tb_inc_100k, aes(y = iso3, 
                        x = e_inc_100k)) +
  stat_halfeye(side = "right") +
  geom_dots(side="left", 
                    fill = "darkorange", color = "darkorange") +
  ylab("") +
  xlab("Inc per 100k") +
  theme_ggdist()

Distribution, variation, uncertainty (1/3)

The most valuable way to show uncertainty is to show all the data.

Plot shows first preference % for greens in the 2019 Australian Federal Election, for the 150 electorates.

Plot of choice is the jittered dotplot, where points are spread vertically according to density.

Code
election <- read_csv("data/election2019.csv",
  skip = 1,
  col_types = cols(
    .default = col_character(),
    OrdinaryVotes = col_double(),
    AbsentVotes = col_double(),
    ProvisionalVotes = col_double(),
    PrePollVotes = col_double(),
    PostalVotes = col_double(),
    TotalVotes = col_double(),
    Swing = col_double()
  )
)
e_grn <- election |>
  group_by(DivisionID) |>
  summarise(
    DivisionNm = unique(DivisionNm),
    State = unique(StateAb),
    votes_GRN = TotalVotes[which(PartyAb == "GRN")],
    votes_total = sum(TotalVotes)
  ) |>
  mutate(perc_GRN = votes_GRN / votes_total * 100)

e_grn |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(x=perc_GRN, y=State)) +
    geom_quasirandom(groupOnX = FALSE, varwidth = TRUE) +
    labs(
      x = "First preference votes %",
      y = ""
    ) +
  xlim(c(0,50))

Distribution, variation, uncertainty (2/3)

What do we learn?

  • Different number of observations in each state
  • One outlier in Vic
  • As a group, ACT has higher %’s
  • Vic has a small cluster of points with higher %’s
  • %’s are mostly very low

This plot ONLY shows uncertainty!

Distribution, variation, uncertainty (3/3)

What would be other common ways to display this data?

  • Side-by-side boxplots
  • Side-by-side violin
  • On a map of electorates

For each plot think about

  • what is uncertainty, and what is estimate
  • what the plot shows or hides

Code
e_grn |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(x=perc_GRN, y=State)) +
    geom_boxplot(varwidth = TRUE) +
    labs(
      x = "First preference votes %",
      y = ""
    ) +
  xlim(c(0,50))

Code
e_grn |>
  mutate(State = fct_reorder(State, perc_GRN)) |>
  ggplot(aes(x=perc_GRN, y=State)) +
    geom_violin(draw_quantiles = c(0.25, 0.5, 0.75),
      fill="#006dae", alpha=0.5) +
    labs(
      x = "First preference votes %",
      y = ""
    ) +
  xlim(c(0,50))

Code
oz_states <- ozmaps::ozmap_states %>% filter(NAME != "Other Territories")
oz_votes <- rmapshaper::ms_simplify(ozmaps::abs_ced)
oz_votes_grn <- full_join(oz_votes, e_grn, by=c("NAME"="DivisionNm"))

ggplot(oz_votes_grn, aes(fill=perc_GRN)) +
  geom_sf(colour="white") +
  scale_fill_viridis_c(direction=-1, trans = "log", 
    guide = "colourbar", 
    labels = scales::label_number(accuracy = 0.1)) +
  theme_map() +
  theme(legend.position = "right", 
    legend.title = element_blank())

Models and uncertainty

Plotting the fitted model alone

Code
data("wages")
wages_fct <- wages |>
  select(id, ln_wages, xp, high_grade) |>
  mutate(high_grade = factor(high_grade))
wages_fit <- lmer(ln_wages~xp + high_grade + (xp|id), data=wages_fct)
wages_fe <- summary(wages_fit)$coefficients
wages_fe_d <- tibble(xp = rep(seq(0, 13, 1), 7),
     high_grade = rep(c(6, 7, 8, 9, 10, 11, 12), rep(14, 7))) |>
  mutate(ln_wages = case_when(
    high_grade == 6 ~ wages_fe[1,1] + wages_fe[2,1]*xp,
    high_grade == 7 ~ wages_fe[1,1] + wages_fe[3,1] + wages_fe[2,1]*xp,
    high_grade == 8 ~ wages_fe[1,1] + wages_fe[4,1]  + wages_fe[2,1]*xp,
    high_grade == 9 ~ wages_fe[1,1] + wages_fe[5,1]  + wages_fe[2,1]*xp,
    high_grade == 10 ~ wages_fe[1,1] + wages_fe[6,1]  + wages_fe[2,1]*xp,
    high_grade == 11 ~ wages_fe[1,1] + wages_fe[7,1]  + wages_fe[2,1]*xp,
    high_grade == 12 ~ wages_fe[1,1] + wages_fe[8,1]  + wages_fe[2,1]*xp)
  ) |>
  mutate(high_grade = factor(high_grade))
ggplot(wages_fe_d) + 
  geom_line(aes(x=xp, 
                y=ln_wages, 
                colour=high_grade, 
                group=high_grade)) +
  scale_colour_discrete_divergingx(palette = "Zissou 1") +
  labs(x="Experience (years)", y="Wages (ln)", colour="Grade") 

Adding the data makes the model look less impressive

Code
ggplot() + 
  geom_line(data=wages_fct, aes(x=xp, y=ln_wages, group=id), alpha=0.1) +
  geom_line(data=wages_fe_d, aes(x=xp, 
                y=ln_wages, 
                colour=high_grade, 
                group=high_grade)) +
  scale_colour_discrete_divergingx(palette = "Zissou 1") +
  labs(x="Experience (years)", y="Wages (ln)", colour="Grade") 

Standard errors as produced by the model fit.

Code
wages_fe_d <- wages_fe_d |>
  mutate(ln_wages_l = case_when(
    high_grade == 6 ~ wages_fe[1,1] - wages_fe[1,2] +
                      (wages_fe[2,1]-wages_fe[2,2])*xp ,
    high_grade == 7 ~ wages_fe[1,1] - wages_fe[1,2] + 
                      wages_fe[3,1] - wages_fe[3,2] + 
                      (wages_fe[2,1]-wages_fe[2,2])*xp,
    high_grade == 8 ~ wages_fe[1,1] - wages_fe[1,2] + 
                      wages_fe[4,1] - wages_fe[4,2] + 
                      (wages_fe[2,1]-wages_fe[2,2])*xp,
    high_grade == 9 ~ wages_fe[1,1] - wages_fe[1,2] + 
                      wages_fe[5,1] - wages_fe[5,2] + 
                      (wages_fe[2,1]-wages_fe[2,2])*xp,
    high_grade == 10 ~ wages_fe[1,1] - wages_fe[1,2] + 
                      wages_fe[6,1] - wages_fe[6,2] + 
                      (wages_fe[2,1]-wages_fe[2,2])*xp,
    high_grade == 11 ~ wages_fe[1,1] - wages_fe[1,2] + 
                      wages_fe[7,1] - wages_fe[7,2] + 
                      (wages_fe[2,1]-wages_fe[2,2])*xp,
    high_grade == 12 ~ wages_fe[1,1] - wages_fe[1,2] + 
                      wages_fe[8,1] - wages_fe[8,2] + 
                      (wages_fe[2,1]-wages_fe[2,2])*xp)
  ) |>
  mutate(ln_wages_u = case_when(
    high_grade == 6 ~ wages_fe[1,1] + wages_fe[1,2] +
                      (wages_fe[2,1]+wages_fe[2,2])*xp ,
    high_grade == 7 ~ wages_fe[1,1] + wages_fe[1,2] + 
                      wages_fe[3,1] + wages_fe[3,2] + 
                      (wages_fe[2,1]+wages_fe[2,2])*xp,
    high_grade == 8 ~ wages_fe[1,1] + wages_fe[1,2] + 
                      wages_fe[4,1] + wages_fe[4,2] + 
                      (wages_fe[2,1]+wages_fe[2,2])*xp,
    high_grade == 9 ~ wages_fe[1,1] + wages_fe[1,2] + 
                      wages_fe[5,1] + wages_fe[5,2] + 
                      (wages_fe[2,1]+wages_fe[2,2])*xp,
    high_grade == 10 ~ wages_fe[1,1] + wages_fe[1,2] + 
                      wages_fe[6,1] + wages_fe[6,2] + 
                      (wages_fe[2,1]+wages_fe[2,2])*xp,
    high_grade == 11 ~ wages_fe[1,1] + wages_fe[1,2] + 
                      wages_fe[7,1] + wages_fe[7,2] + 
                      (wages_fe[2,1]+wages_fe[2,2])*xp,
    high_grade == 12 ~ wages_fe[1,1] + wages_fe[1,2] + 
                      wages_fe[8,1] + wages_fe[8,2] + 
                      (wages_fe[2,1]+wages_fe[2,2])*xp)
  ) 

ggplot() + 
  geom_ribbon(data=wages_fe_d, 
            aes(x=xp, 
                ymin=ln_wages_l,
                ymax=ln_wages_u,
                fill=high_grade), colour=NA, alpha=0.1) +
  geom_line(data=wages_fe_d, 
            aes(x=xp, 
                y=ln_wages, 
                colour=high_grade, 
                group=high_grade)) +
  scale_fill_discrete_divergingx(palette = "Zissou 1") +
  scale_colour_discrete_divergingx(palette = "Zissou 1") +
  labs(x="Experience (years)", y="Wages (ln)", colour="Grade", fill="Grade") 

Examine individual fits. Too many to show all - sample multiple times to get through them all.

Code
wages_full <- wages_fct |>
  add_predictions(wages_fit, 
                  var = "pred") |>
  add_residuals(wages_fit, 
                var = "res")
set.seed(1222)
wages_full |> add_n_obs() |> filter(n_obs > 4) |>
  sample_n_keys(size = 12) |>
  ggplot() + 
  geom_line(aes(x = xp, y = pred, group = id, 
             colour = factor(id))) + 
  geom_point(aes(x = xp, y = ln_wages, 
                 colour = factor(id))) + 
  facet_wrap(~id, ncol=4)  +
  scale_x_continuous("Experience (years)", 
    breaks=seq(0, 12, 2)) +
  ylab("Wages (ln)") +
  theme(aspect.ratio = 0.6, legend.position = "none")

  • Plotting the model alone suggests higher education increases wages (by XXX), particularly if the student has 12 years (full high school).
  • Adding the individual level observations shows the large individual-to-individual variability, which swamps the education differences. It also suggests the individual observations might not have linear increase.
  • Adding representation of the standard error shows the “overlap” between education levels, that the estimate for wage increase with 12 years of education is not substantially better than 10 years, but it is better than 6 years.
  • Displaying the observed and fitted values separately by individual, examines individual level uncertainty.

Note that, this particular model fit makes the assumption that wages increase linearly with years of experience.

Common measures and representations

Melbourne pedestrian counts at Southern Cross Station, Sunday Aug 31, 2025.

load("data/ped_Aug2025.rda")
ped_sc <- ped |>
  filter(Sensor == "Southern Cross Station") |>
  filter(Date == ymd("2025-08-31")) |>
  group_by(Time) |>
  summarise(Count = sum(Count), .groups = "drop") |>
  mutate(se = sqrt(Count))
b1 <- ggplot(ped_sc, aes(x=Time, y=Count)) +
  geom_col(fill = "#20794D") +
  xlab("Hour")
b2 <- ggplot(ped_sc, aes(x=Time, y=Count)) +
  geom_col(fill = "#b9ca4a") +
  geom_errorbar(aes(ymin = Count - se, ymax = Count + se),
    width=0.5, colour="#20794D") +
  xlab("Hour")
b3 <- ggplot(ped_sc, aes(x=Time,
    ydist=distributional::dist_normal(Count, se))) +
  stat_pointinterval(colour = "#20794D") +
  xlab("Hour") + ylab("Count")
b4 <- ggplot(ped_sc, aes(x=Time,
    ydist=distributional::dist_normal(Count, se))) +
  stat_gradientinterval(colour = NA, fill="#20794D", 
    .width=1) +
  geom_line(aes(x=Time, y=Count), colour="#20794D") +
  xlab("Hour") + ylab("Count")
b5 <- ggplot(ped_sc, aes(x=Time, y=Count)) +
  geom_ribbon(aes(ymin = Count - qnorm(0.975)*se, 
                  ymax = Count + qnorm(0.975)*se),
    fill = "#b9ca4a") +
  geom_line(colour="#20794D") +
  xlab("Hour")
ped_sc_ci <- ped_sc |>
  mutate(l50 = Count - qnorm(0.75)*se,
         u50 = Count + qnorm(0.75)*se,
         l80 = Count - qnorm(0.9)*se,
         u80 = Count + qnorm(0.9)*se,
         l99 = Count - qnorm(0.995)*se,
         u99 = Count + qnorm(0.995)*se
  ) |>
  pivot_longer(cols=l50:u99, names_to = "intprob", 
    values_to="value") |>
  mutate(bound = str_sub(intprob, 1, 1),
       prob = str_sub(intprob, 2, 3)) |>
  select(Time, Count, se, prob, bound, value) |>
  pivot_wider(names_from = bound, values_from = value)
b6 <- ggplot(ped_sc_ci, aes(x=Time, y=Count)) +
  geom_lineribbon(aes(ymin = l, ymax = u, fill = prob)) +
  labs(x="Hour", fill="Confidence") +
  scale_fill_discrete_sequential(palette = "Greens", 
    rev=FALSE, n=5)  
b7 <- ggplot(ped_sc, aes(x=Time, y=Count)) +
  geom_smooth(colour = "#20794D", fill = "#b9ca4a") +
  geom_point(colour = "#20794D") +
  xlab("Hour")

Forecasting

This is a forecast of business trips to Melbourne for 2018-2022, based on data from 2000-2017. How is the uncertainty represented?

Code
tourism_melb <- tourism %>%
  filter(Region == "Melbourne", Purpose == "Business")
fit <- tourism_melb %>%
  model(
    ets = ETS(Trips ~ trend("A"))
  )
fc <- fit |>
  forecast(h = "5 years")
fc |>
  autoplot(tourism_melb) +
    theme(aspect.ratio = 0.6)

Representation is by simulating and displaying a number of possible forecasts.

Code
fc_b <- fit |>
  forecast(h = "5 years", bootstrap = TRUE)
fc_b_samples <- fit |>
  generate(h = 20, times = 50, bootstrap = TRUE)
fc_b_samples |>
  ggplot() +
    geom_line(aes(x = Quarter, y = .sim, group = .rep),
      colour = "#027EB6", alpha=0.1) +
    geom_line(data=fc_b, aes(x = Quarter, y = .mean), 
      colour = "#027EB6") +
    autolayer(tourism_melb, Trips) +
    ylab("Trips") +
    theme(aspect.ratio = 0.6)

Mapping and data

Map thinning

The big change from working with maps in a GIS and maps for data analysis is the SIZE of the map data.

We are going to demonstrate how you need to change your approach, using the code in map.R.

Displaying statistics with a map

A choropleth map is used to show a measured variable associated with a political or geographic region. Polygons for the region are filled with colour.

The purpose is to examine the spatial distribution of a variable.

Code
sa2 <- strayr::read_absmap("sa22011") |> 
  filter(!st_is_empty(geometry)) |> 
  filter(!state_name_2011 == "Other Territories") |> 
  filter(!sa2_name_2011 == "Lord Howe Island")
sa2 <- sa2 |> rmapshaper::ms_simplify(keep = 0.5, keep_shapes = TRUE) # Simplify the map!!!
SIR <- read_csv("data/SIR Downloadable Data.csv") |> 
  filter(SA2_name %in% sa2$sa2_name_2011) |> 
  dplyr::select(Cancer_name, SA2_name, Sex_name, p50) |> 
  filter(Cancer_name == "Thyroid", Sex_name == "Females")
ERP <- read_csv("data/ERP.csv") |>
  filter(REGIONTYPE == "SA2", Time == 2011, Region %in% SIR$SA2_name) |> 
  dplyr::select(Region, Value)
# Alternative maps
# Join with sa2 sf object
sa2thyroid_ERP <- SIR |> 
  left_join(sa2, ., by = c("sa2_name_2011" = "SA2_name")) |>
  left_join(., ERP |> 
              dplyr::select(Region, 
              Population = Value), by = c("sa2_name_2011"= "Region")) |> 
  filter(!st_is_empty(geometry))
sa2thyroid_ERP <- sa2thyroid_ERP |> 
  #filter(!is.na(Population)) |> 
  filter(!sa2_name_2011 == "Lord Howe Island") |> 
  mutate(SIR = map_chr(p50, aus_colours)) |> 
  st_as_sf() 
save(sa2, file="data/sa2.rda")
save(sa2thyroid_ERP, file="data/sa2thyroid_ERP.rda")
Code
# Plot the choropleth
load("data/sa2thyroid_ERP.rda")
aus_ggchoro <- ggplot(sa2thyroid_ERP) + 
  geom_sf(aes(fill = SIR), size = 0.1) + 
  scale_fill_identity() + invthm
aus_ggchoro

The problem with choropleth maps

The problem is that high density population areas may be very small geographically. They can disappear in a choropleth map, which means that we get a biased sense of the spatial distribution of a variable.

Cartograms

A cartogram transforms the geographic shape to match the value of a statistic or the population. Its a useful exploratory technique for examining the spatial distribution of a measured variable.



BUT they don’t work for Australia.

Code
# transform to NAD83 / UTM zone 16N
nc <- st_read(system.file("shape/nc.shp", package="sf"), quiet=TRUE)

nc <- nc |>
  mutate(lBIR79 = log(BIR79))
nc_utm <- st_transform(nc, 26916)

orig <- ggplot(nc) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("choropleth") +
  theme_map() +
  theme(legend.position = "none")

nc_utm_carto <- cartogram_cont(nc_utm, weight = "BIR74", itermax = 5)

carto <- ggplot(nc_utm_carto) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("cartogram") +
  theme_map() +
  theme(legend.position = "none")

nc_utm_dorl <- cartogram_dorling(nc_utm, weight = "BIR74")

dorl <- ggplot(nc_utm_dorl) + 
  geom_sf(aes(fill = lBIR79)) +
  ggtitle("dorling") +
  theme_map() +
  theme(legend.position = "none")

orig + carto + dorl + plot_layout(ncol=1)

Hexagon tile

A hexagon tile map represents every spatial polygon with an equal sized hexagon. In dense areas these will be tesselated, but separated hexagons are placed at centroids of the remote spatial regions.


Now the higher thyroid incidence in Perth suburbs, some Melbourne suburbs, and Sydney are more visible.

Code
if (!file.exists("data/aus_hexmap.rda")) {
  
## Create centroids set
centroids <- sa2 |> 
  create_centroids(., "sa2_name_2011")
## Create hexagon grid
grid <- create_grid(centroids = centroids,
                    hex_size = 0.2,
                    buffer_dist = 5)
## Allocate polygon centroids to hexagon grid points
aus_hexmap <- allocate(
  centroids = centroids,
  hex_grid = grid,
  sf_id = "sa2_name_2011",
  ## same column used in create_centroids
  hex_size = 0.2,
  ## same size used in create_grid
  hex_filter = 10,
  focal_points = capital_cities,
  width = 35,
  verbose = FALSE
)
save(aus_hexmap, 
     file = "data/aus_hexmap.rda")
}

load("data/aus_hexmap.rda")
## Prepare to plot
fort_hex <- fortify_hexagon(data = aus_hexmap,
                            sf_id = "sa2_name_2011",
                            hex_size = 0.2) |> 
            left_join(sa2thyroid_ERP |> select(sa2_name_2011, SIR, p50))
## Make a plot
aus_hexmap_plot <- ggplot() +
  geom_sf(data=sa2thyroid_ERP, fill=NA, colour="grey60", size=0.1) +
  geom_polygon(data = fort_hex, aes(x = long, y = lat, group = hex_id, fill = SIR)) +
  scale_fill_identity() +
  invthm 
aus_hexmap_plot  

Plots and statistical inference

Tidy data and random variables

  • Tidy data mirrors elementary statistics
  • Tabular form puts variables in columns and observations in rows
  • Not all tabular data is in this form
  • In this form, we can think about \(X_1 \sim N(0,1), ~~X_2 \sim \text{Exp}(1) ...\)

\[\begin{align}X &= \left[ \begin{array}{rrrr} X_1 & X_2 & ... & X_p \end{array} \right] \\ &= \left[ \begin{array}{rrrr} X_{11} & X_{12} & ... & X_{1p} \\ X_{21} & X_{22} & ... & X_{2p} \\ \vdots & \vdots & \ddots& \vdots \\ X_{n1} & X_{n2} & ... & X_{np} \end{array} \right]\end{align}\]

Grammar of graphics and statistics

  • A statistic is a function on the values of items in a sample, e.g. for \(n\) iid random variates \(\bar{X}_1=\displaystyle\sum_{i=1}^n X_{i1}\), \(s_1^2=\displaystyle\frac{1}{n-1}\displaystyle\sum_{i=1}^n(X_{i1}-\bar{X}_1)^2\)
  • We study the behaviour of the statistic over all possible samples of size \(n\).
  • The grammar of graphics is the mapping of (random) variables to graphical elements, making plots of data into statistics

Making inference from data

Traditional statistical inference

You need a:

  • statistic computed from the data
  • null and alternative hypothesis
  • reference distribution on which to measure the statistic
    • if it is extreme on this scale, reject the null

Inference with data plots

You need a:

  • plot description provided by the grammar (a statistic)
    • This implies one or more null hypotheses
  • null generating mechanism, e.g. permutation, simulation from a distribution or model
  • visual evaluation: is one plot in the array different?

Example using TB (1/2)

We made various statements about patterns. Let’s take the trend plot comparing temporal trend of counts males and females in different age groups. We said:

  • Counts for men are higher across all age groups

We will make comparison plots by simulating sex from a binomial distribution, for each age group and year.

Code
# Need a binary variable to use simulation
tb_idn_long <- tb_idn |>
  filter(year > 2000, year < 2013) |>
  mutate(sex01 = ifelse(sex=="m", 0, 1)) |>
  select(year, age_group, sex01, count) |>
  uncount(count)

# Generate the null samples
set.seed(733)
pos = sample(1:4)
tb_in_s1 <- tb_idn_long |>
  group_by(year, age_group) |>
  mutate(sex01 = rbinom(n(), 1, 0.5)) |>
  ungroup() |>
  count(year, age_group, sex01) |>
  rename(count = n) |>
  mutate(.sample = pos[1])
tb_in_s2 <- tb_idn_long |>
  group_by(year, age_group) |>
  mutate(sex01 = rbinom(n(), 1, 0.5)) |>
  ungroup() |>
  count(year, age_group, sex01) |>
  rename(count = n) |>
  mutate(.sample = pos[2])
tb_in_s3 <- tb_idn_long |>
  group_by(year, age_group) |>
  mutate(sex01 = rbinom(n(), 1, 0.5)) |>
  ungroup() |>
  count(year, age_group, sex01) |>
  rename(count = n) |>
  mutate(.sample = pos[3])
tb_idn_d <- tb_idn_long |>
  count(year, age_group, sex01) |>
  rename(count = n) |>
  mutate(.sample = pos[4])

# Make the lineup  
tb_l <- bind_rows(tb_idn_d, tb_in_s1, tb_in_s2, tb_in_s3) 
  
ggplot(tb_l, aes(x=year, y=count, colour=factor(sex01))) +
  geom_point() +
  geom_smooth(se=F) + 
  facet_grid(.sample~age_group) +
  ylim(c(0,NA)) + 
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        legend.position = "none",
        strip.text.x = element_blank())

Example using TB (2/2)

Now lets check the temporal trend of proportion of males and females

We will make comparison plots by randomising the proportion across years for each age group.

Code
set.seed(854)
pos = sample(1:4)

tb_idn_prop <- tb_idn |> group_by(year, age_group) |> 
  summarise(p = count[sex=="m"]/sum(count)) 

tb_idn_prop_s1 <- tb_idn_prop |>
  group_by(age_group) |>
  mutate(p = sample(p)) |>
  mutate(.sample = pos[1])
tb_idn_prop_s2 <- tb_idn_prop |>
  group_by(age_group) |>
  mutate(p = sample(p)) |>
  mutate(.sample = pos[2])
tb_idn_prop_s3 <- tb_idn_prop |>
  group_by(age_group) |>
  mutate(p = sample(p)) |>
  mutate(.sample = pos[3])
tb_idn_prop <- tb_idn_prop |>
  mutate(.sample = pos[4])

tb_idn_prop_l <- bind_rows(tb_idn_prop, tb_idn_prop_s1, 
  tb_idn_prop_s2, tb_idn_prop_s3)

ggplot(tb_idn_prop_l, aes(x=year, y=p)) +
  geom_hline(yintercept = 0.50, colour="grey50", linewidth=2) +
  geom_point() +
  geom_smooth(se=F) + 
  facet_grid(.sample~age_group) +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        legend.position = "none",
        strip.text.x = element_blank())

Inference process

Null-generating mechanisms

  • Permutation: randomizing the order of one of the variables breaks association, but keeps marginal distributions the same
  • Simulation: from a given distribution, or model. Assumption is that the data comes from that model.

Evaluation

  • Compute \(p\)-value, essentially following a binomial formula with multiple people reading the plots
  • Power \(=\) signal strength, based people picking the data plot with one plot design more than another

Using the nullabor package

# Make a lineup of mtcars data
# 20 plots, one data, 19 nulls
# Which one is different?
set.seed(913)
library(ggplot2)
l <- ggplot(
  lineup(
    null_permute('mpg'), 
    mtcars), 
  aes(mpg, wt)
) +
  geom_point() +
  facet_wrap(~ .sample) +
  theme(axis.text = element_blank(),
        axis.title = element_blank())
# Compute the p-value for 10 observers
pvisual(5, 10, 20)
     x simulated   binom
[1,] 5     2e-04 6.4e-05

Very important

  • \(H_0\) (underlying the null sample generation) is determined based on the plot type

  • \(H_0\) is not based on the structure seen in the data set

  • Null data creation method does not match characteristics of original sample other than that in \(H_0\)

Perceptual principles

Game: Which plot wears it better?

Coming up: 2 different plots of 2012 TB incidence in Indonesia, based on variables sex and age.

  • In arrangement A, separate plots are made for age, and sex is mapped to the x axis.
  • Conversely, in arrangement B, separate plots are made for sex, and age is mapped to the x axis.

At which age(s) are the counts for males and females relatively the same?

Which plot makes this question easier to answer?

Three Variables

Next, we have two different plots of based on three variables: age, sex, year.

  • In plot type A, a line plot of counts is drawn separately by age and sex, and year is mapped to the x axis.
  • Conversely, in plot type B, counts for sex, and age are stacked into a bar chart, separately by age and sex, and year is mapped to the x axis

Is the trend for females increasing over time for all agre groups?

Which plot makes this easier?

Perceptual principles

  • Hierarchy of mappings
  • Pre-attentive: some elements are noticed before you even realise it.
  • Color palettes: qualitative, sequential, diverging. Make sure the mapping is appropriate.
  • Proximity: Place elements for primary comparison close together.
  • Change blindness: When focus is interrupted differences may not be noticed.
  • Aspect ratio: the relative width to height makes some patterns easier or harder to read

Hierarchy of mappings

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

(Cleveland, 1984; Heer and Bostock, 2009)

  1. scatterplot, barchart
  2. side-by-side boxplot, stacked barchart
  3. piechart, rose plot, gauge plot, donut, wind direction map, starplot
  4. treemap, bubble chart, mosaicplot
  5. chernoff face
  6. choropleth map

Color palettes

  • sequential
  • diverging
  • qualitative

Color Brewer annotates palettes with attributes.

display.brewer.all()

Which type of palette is best?

Color blind-proofing

clrs <- hue_pal()(9)
d + theme(legend.position = "none")

clrs <- dichromat(hue_pal()(9))
d + 
  scale_colour_manual("", values=clrs) + 
  theme(legend.position = "none")
  • Online checking tool coblis: upload an image and it will re-map the colors for different colour perception issues.
  • The package colorblind has color blind friendly palettes (Susan: but the colours are awful 😭).
  • The package colorspace also has good tools for proofing for colorblind proof palettes

Color deficiency friendly color schemes

Color blind simulation

Original colours

Color blind view

Pre-attentive

Can you find the odd one out?

Pre-attentive

Is it easier now?

Proximity

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Mapping and proximity

Same proximity is used, but different geoms.

  • Which is better to determine the relative ratios of males to females by age?

Mapping and proximity

Same proximity is used, but different geoms.

Which is better to determine the relative ratios of ages by sex?

Change blindness

ggplot(dsamp, aes(x=carat, y=price, colour = clarity)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  scale_color_brewer(palette="Set1") +
  facet_wrap(~clarity, ncol=4)

Which has the steeper slope, VS1 or VS2?

Change blindness

Making comparisons across plots requires the eye to jump from one focal point to another.

It may result in not noticing differences.

ggplot(dsamp, aes(x=carat, y=price, 
                  colour = clarity)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  scale_color_brewer(palette="Set1") 

Core principles 1/2

  • Make a plot of your data!
    • The hierarchy matters if the structure is weak or differences b/w groups are small.
  • Knowing how to use proximity is a valuable and rare skill
  • Use of colour: don’t over use
    • Too many colours
    • Mapping cts variable to colour to add another dimension

Core principles 2/2

  • Show the data!
    • Statistics are good if there’s too much data
    • Always plot the data for yourself to see the variability
  • One plot is never enough
    • Plot the data in different ways
    • Understand the relationships between variables

Resources